A comparison of Bayesian classification trees and random forest to identify classifiers for childhood leukaemia

نویسندگان

  • K. W. Carter
  • M. J. Firth
  • N. H. de Klerk
چکیده

Recently, microarrays technologies have been extensively used to distinguish gene expression in acute lymphoblastic leukaemia (ALL) (e.g. Pui et al., 2004; Hoffmann et al., 2008). ALL is the most common type of leukaemia diagnosed in children, with an incidence rate of about 4 per 100,000 per year (Pizzo and Poplack, 2001; Milne et al., 2008). There are six main subtypes of leukaemia, one of which is Tcell acute lymphoblastic leukaemia (T-ALL) which generally has lower cure rates than other forms of ALL. Ribonucleic acid (RNA) samples from each patient can be put onto microarrays to provide gene expression levels for around 20 thousand genes (depending on which microarray chip is used). One of the challenges with microarray analysis in leukaemia research is identifying the smallest possible set of genes that predict relapse with the highest predictive performance. Currently, one approach used to identify important differentially expressed genes is Random Forest (RF) (e.g. Hoffmann, 2006; Díaz-Uriarte and Alvarez de Andrés, 2006). RF is a classifier that consists of an ensemble of classification trees, and yields the average class for each Y observation (each patient). DíazUriarte and Alvarez de Andrés (2006) identified the characteristics that make RF ideal for microarray data, these include: RF can handle more variables than observations (large p small n problems); RF can be applied to binary and multi-class problems; RF has good predictive performance for datasets containing a large number of noise variables and does not overfit; RF can use both categorical and continuous predictors and investigates interactions; the results from RF are unaltered by monotone transformations of the variables; a free R library exists that performs RF; RF provides measures of variable importance and for the most part one does not have to fine-tune parameters to obtain good predictive performance. This paper describes an alternative approach to identifying a gene classifier for predicting relapse in ALL. Bayesian approaches to classification and regression trees (BCART) were proposed by Chipman et al. (1998), Denison et al. (1998) and Buntine (1992). BCART identifies “good” trees using a stochastic search algorithm that applies a reversible jump Markov chain Monte Carlo method. The set of best trees are selected that have the highest prediction accuracy (O’Leary et al. 2008). Fan and Gray (2005) gave BCART an A+ for interpretability and B+ for prediction. To date, BCART has been largely based on “noninformative”, usually conjugate priors. Moreover, there are only a few real-world applications of BCART (Lamon & Stow, 2004; Partridge et al., 2006; Schetinin et al., 2007). This statistical approach has not been applied to large p small n problems (to the author’s knowledge). Here we compare RF and BCART for predicting relapse in three ALL datasets, using gene expression values as the covariates. In all three datasets, the best tree identified from BCART had better accuracy and in particular better prediction of relapse (higher sensitivity) than RF. BCART also had better performance than RF in identifying important genes that predicts whether a patient will relapse.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Author gender identification from text using Bayesian Random Forest

Nowadays high usage of users from virtual environments and their connection via social networks like Facebook, Instagram, and Twitter shows the necessity of finding out shared subjects in this environment more than before. There are several applications that benefit from reliable methods for inferring age and gender of users in social media. Such applications exist across a wide area of fields,...

متن کامل

Comparison of Machine Learning Algorithms for Broad Leaf Species Classification Using UAV-RGB Images

Abstract: Knowing the tree species combination of forests provides valuable information for studying the forest’s economic value, fire risk assessment, biodiversity monitoring, and wildlife habitat improvement. Fieldwork is often time-consuming and labor-required, free satellite data are available in coarse resolution and the use of manned aircraft is relatively costly. Recently, unmanned aeria...

متن کامل

Ensemble Classification and Extended Feature Selection for Credit Card Fraud Detection

Due to the rise of technology, the possibility of fraud in different areas such as banking has been increased. Credit card fraud is a crucial problem in banking and its danger is over increasing. This paper proposes an advanced data mining method, considering both feature selection and decision cost for accuracy enhancement of credit card fraud detection. After selecting the best and most effec...

متن کامل

Multispectral Image Analysis Using Random Forest

Classical methods for classification of pixels in multispectral images include supervised classifiers such as the maximum-likelihood classifier, neural network classifiers, fuzzy neural networks, support vector machines, and decision trees. Recently, there has been an increase of interest in ensemble learning – a method that generates many classifiers and aggregates their results. Breiman propo...

متن کامل

Classifier Ensemble Based Class Weightening

Many methods have been proposed for combining multiple classifiers in pattern recognition such as Random Forest which uses decision trees for problem solving. In this paper, we propose a weighted vote-based classifier ensemble method. The proposed method is similar to Random Forest method in employing many decision trees and neural networks as classifiers. For evaluating the proposed weighting ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009